In [1]:
%pylab inline
In [2]:
import nltk
import os
from pprint import pprint
Let's scale up from the individual text to whole collections. The term "corpus" is commonly used in text analysis.
The concept of a corpus is helpful because it refers not just to a bunch of texts, but also points to the act of purposeful aggregation that brought those texts together. For example, we use the term "corpus" to refer to the collected works of a particular author. For your research, you may aggregate a corpus based on thematic or temporal criteria which are informed by your research questions and theoretical assumptions. When analyzing a corpus, or interpreting the results of such an analysis, it is important to keep in mind its provenance -- not just its contents.
In [4]:
text_root = '../../data/EmbryoProjectTexts/files'
try:
assert os.path.exists(text_root)
except AssertionError:
print "That directory doesn't exist!"
In [24]:
documents = nltk.corpus.PlaintextCorpusReader(text_root, 'https.+')
In [25]:
documents.words()
Out[25]:
In [7]:
wordnet = nltk.WordNetLemmatizer()
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
def normalize_token(token):
"""
Convert token to lowercase, and stem using the Porter algorithm.
Parameters
----------
token : str
Returns
-------
token : str
"""
return wordnet.lemmatize(token.lower())
def filter_token(token):
"""
Evaluate whether or not to retain ``token``.
Parameters
----------
token : str
Returns
-------
keep : bool
"""
token = token.lower()
return token not in stoplist and token.isalpha() and len(token) > 2
We can get the frequency of tokens in our corpus just like we did for a single text, using a FreqDist
(frequency distribution).
In NLTK, frequencies and probabilities are usually discussed in terms of "experiments". A frequency distribution records the frequency of specific outcomes (samples) of a repeated experiment. In this case, we are sampling tokens from a text.
In [26]:
word_counts = nltk.FreqDist([normalize_token(token)
for token in documents.words()
if filter_token(token)])
In [27]:
word_counts.plot(20)
In [10]:
document_counts = nltk.FreqDist([
token # Each token will be counted a maximum of 1 time per text.
for fileid in documents.fileids()
for token in set( # There can be no duplicates in a set.
[normalize_token(token) # Normalize first!
for token
in documents.words(fileids=[fileid])
if filter_token(token)]
)
])
In [14]:
document_counts.plot(70)
In the figure above, we can see that the top ~40 words occur in around 630 texts. We can see precise values using the most_common()
function:
In [15]:
document_counts.most_common(10) # Get the 10 most common words.
Out[15]:
It can be useful to examine the number of texts in which a word occurs, to get a better picture of its distribution over the corpus. We can use a FreqDist
for this, too.
We modify our logic slightly: for each document, we conver the list of normalized/filtered tokens into a set. This means that each word will be counted only once per text, even if several tokens are present.
It turns out that there are 628 texts in this corpus...
So these are words that occur in every single text in the corpus.
In [28]:
len(documents.fileids())
Out[28]:
In computational humanities, it is very unusual to analyze a corpus without reference to at least some minimal metadata. The Python package called Tethne
provides some useful mechanisms for importing metadata from Zotero RDF and other bibliographic formats.
In [18]:
from tethne.readers import zotero
zotero_export_path = '../../data/EmbryoProjectTexts'
metadata = zotero.read(zotero_export_path, index_by='link', follow_links=False)
Since we indexed our metadata using the "link" field, we can look up metadata for each text using its fileid
.
In [29]:
example_fileid = documents.fileids()[0]
print 'This is the fileid:', example_fileid, '\n'
print 'This is the metadata for this fileid:', '\n'
pprint(metadata[example_fileid].__dict__) # pprint means "pretty print".
We can use metadata to add dimensionality to our texts. To examine the distribution of a token over time, we can use a ConditionalFreqDist
(conditional frequency distribution). Just like the FreqDist
, the ConditionalFreqDist
records the outcomes (samples) of an experiment. A ConditionalFreqDist
also records a label, or condition, for each outcome.
In the example below, we examine the word usage of different authors. The conditions are the author names, and the samples are tokens. We will limit our analysis to four specific tokens: 'organism', 'ivf', 'pluripotent', 'supreme'
(doing this for all tokens would be pretty costly).
In [20]:
focal_tokens = ['organism', 'ivf', 'pluripotent', 'supreme']
authorDist = nltk.ConditionalFreqDist([
(str(author[0]), normalize_token(token)) # (condition, sample)
for fileid in corpus.fileids()
for token in corpus.words(fileids=[fileid])
for author in metadata[fileid].authors
if filter_token(token)
and normalize_token(token) in focal_tokens
])
In [21]:
authorDist.tabulate()
We can also use a ConditionalFreqDist
to see how tokens are distributed over time. This works just like the distribution of words over authors, except that in this case we will treat our tokens as conditions. Think of it like this: each time we encounter one of the tokens in our list of focal tokens (conditions), we sample the publication date of the text from which it was drawn.
In [22]:
focal_tokens = ['organism', 'ivf', 'pluripotent', 'supreme']
timeDist = nltk.ConditionalFreqDist([
(normalize_token(token), metadata[fileid].date)
for fileid in corpus.fileids()
for token in corpus.words(fileids=[fileid])
if filter_token(token)
and normalize_token(token) in focal_tokens
])
In [23]:
timeDist.plot()